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Introduction 


1 Introduction 


Analysis for software measurement data is essential to assess the quality of 
software developed. Building prediction models for external attributes of sys- 
tems or investigate particular relationships, e.g. size with some specific quality 
attribute, are able to capture information about past processes and use it to 
help make judgment about existing or future ones [5]. 


Data Analysis provides a means to make software measures comparable and 
determine the characteristics of metrics and relationships among them [5]. 


Process Overview 


M-System is a tool suit for measurement of systems that calculates a set of 
measures (OO product measures) from structural properties for a software sys- 
tem. The process to apply M-system is presented below. 


Perceptions 
What to MeaSUe —sesssssssssssssssssssssseeeeee : 
ys Quality : 
i Definition =; Test/Parsing 


Sereteserenenernapeerenenerseneneeet rasults esrisidnn Hadttaceeansdleasctewlinsaeees . 
Parsing 


Sources files 


Manual code 
inspection 


Metrics 


Change difficulty 
indices 


Application of 
results 


[ = main steps of process 


= relevant steps, but often 
not addressed explicitly 


| Statistical Analysis | 


Metrics Result 


Parser M-System STATISTICA Expert/Inspector 
(MACRO) 


Figure 1: M-system process ([4], page 4) 


In the earlier stages of the process (see figure1), quality definition and metrics 
specification activities yield the quality model and its associated metrics. Parsing 
of the source code and data retrieval of measurement, provide the necessary 
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outcomes (M-System data) for the next stage: Statistical Analysis, where the 
results will be analysed in an appropriate way. 


This manual aims at describing the process of analysing statistically measure- 
ment data from M-System to investigate the relationships between product 
measures and external quality attributes, e.g. effort, fault-proneness, etc. A 
Tool Support for this process is provided as well. 


The manual is organized as follows. Section 2 begins by introducing the process 
of analysing M-System Data that will be described in the later section. Section 
3 presents the data analysis process as well as some examples with the macro 
developed for that purpose. STATISTICA [17] is used as Tool Support. Section 4 
summaries the results. Section 5 provides several resources about the tech- 
niques and Tools used for identical purpose. The handbook also includes an 
Appendix with a brief introduction, installation and usage of the Tool (Macro) 
using STATISTICA. 
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2 Data analysis issues 


Goals and data set 


The data analysis is the analysis of the provided data set according to the de- 
fined goals. Before the data analysis can start, the goals of the study have to be 
defined and the data should be collected. 


The Goals of the study have to be defined during the phase quality definition 
(see figure 1). Goals are preferably defined according the GQM goal measure- 
ment template. When defining goals as well as specifying the metrics, also a set 
of hypotheses and the collection of variables, which will influence the hypothe- 
ses, will be defined. 


The Data set result from the M-system. They are M-System Data: All measures 
to be analysed, are extracted from the system in study with the M-System tool. 
The results are stored in files that will be used later. 


The results provided by M-System are stored in files according the following 
format. 


class LCOM1 LCOM2 LCOM3 LCOM4 LCOM5 Coh Co LCC TCC 
5 3 0.93706 0.12987 0.08333 0.75000 0.42857 
3 1.00000 0.33333 -1.0000 0.00000 0.00000 


0.00000 1.00000 1.00000 1.00000 1.00000 0 
1 2.00000 0.00000 1.00000 0.00000 0.00000 
44 36 0.99693 0.01710 0.00227 0.08134 0.05754 


3 
2 2 1.46000 0.02666 -0.5000 0.33333 0.33333 
1 
2 


Figure 2: Output file format for cohesion metric. Filename: name.coh.dump 


These files will be used as input files for the Macro that implements the statisti- 
cal techniques described in the following sections. The first two columns de- 
scribe the system and the classes, to which the measures belong. The rest of 
the columns are the measures obtained from M-System. 
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Data Analysis techniques and activities 


Data analysis: Depending on the purpose or objective of the study, it can inves- 
tigate particular characteristics of the data and explore relationships between 
the measures. Several statistical techniques (see figure 3) are presented in this 
manual for that purpose (software metrics analysis). 


Selection of the variables of Setting Hypotheses 
interest 
M-System Data 


Data Analysis | 


Descriptive Statistics 


v 
Principal Component Analysis 


Univariate Analysis 
Logistic / Poisson Regression 


Logistic / Poisson Regression 


Optional steps Validity threats 


Figure 3. Process of analysing of M-System data. 


The figure 3 shows the whole process for analysing M-System Data. This proc- 
ess is proposed in [1] [2][3] like a precise, complete and repeatable analysis pro- 
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cedure, which, when followed in replicated studies, enables comparisons to be 
made across them. 


Text Boxes show the options for analysing according to criteria. The order 
should be, from top to bottom (see figure 3): descriptive statistics, then, op- 
tionally correlation with size, then PCA, univariate analysis, multivariate analysis, 
fit, cross-validation and model application. 


An overview of the process is presented below 
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Descriptive Statistics: give a first idea about how the data are distributed. 
These results can draw conclusions about the selection of technique to be 
used. 

Correlation to Size: Investigate the relationship between design measures 
and the size of the classes can help to find out if the measure (coupling, co- 
hesion, etc) is measuring size ([1],[2]). This can help to understand its rela- 
tionship with the quality attributes in study (effort, fault-proneness, etc). 
Principal Component Analysis (PCA): Is a statistical technique that is used 
to identify the number of dimensions of the product measures and reduce 
the number of them to be used in later techniques. 

Univariate Analysis: Investigate each individual measure against the quality 
attribute to be study (dependent variable), is performed with Univariate 
Analysis. 

Multivariate Analysis: Investigate, how well multiple measures, when they 
are used in combination, predict the dependent variable is performed with 
Multivariate Analysis. 

Fit: Evaluate the accuracy of the predictions models and give an estimation 
about how well is the solution 

Cross-Validation: Estimate the accuracy when using the model on a differ- 
ent data set 

Model Application: Usage of the prediction models built in practice (during 
software development process) and evaluation of its performance. 

Validity threats: It is desirable to validate the results and the possible risks 
that are taking in account during the study. 


Data analysis issues 


Figure 4. 


Macro for Statistical Analysis of M-System Data 


Statistical Analysis of M-System Data 3 x 


;- Descriptive Statistics & Outliers (1) Principal Component Analysis (2) — 


Summary | PCA Run 


Dutliers | JV Minimum eigenvalue = 1 
N>[6 Load files | 


Univariate Analysis (3, - Mulltivariate Analysis (4) 


Logistic Regression | Logistic Regression 
alpha fo.05 


 Cross-Validation (5) 


Run C¥ | 


Form for data analysis 


To support the data analysis process, a macro has been implemented in 
STATISTICA. Figure 4 presents the control panel for this macro and shows the 
functionality for supporting the data analysis process. Details about Installation 
and Usage can also be found in the appendix described in section 7 


The application (see figure 4) provides all functionality and gives support to 
complete process (see figure 3). 


Note that although some steps in the process (also named with numbers in the 
form) can be applied without any previous treatment of the data, for the steps 
4 (Multivariate Analysis) and 5 (Cross-Validation) are necessary results of the 
earlier steps and some kind of analysis. 
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c Data analysis process 


3.1 Reading the data files 
Previous chapter has already denoted that the data set (from M-system) is the 
starting point for the analysis. The following provides the process how to read 
the files into the STATISTICA Macro. 


1) Click the button “Load Files”. 


stical Analysis of M-System Data 


Figure 5. Control Panel for the data analysis 


2) Select the files involved in the analysis. 


Figure 6. Open data file dialog 
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3) Select a file and click ok. 


en M-System Data Files 2 yi 


Coupling cle}  [Cpalacios\datalbwiot.xt [| 
Cohesionfech) fo 
Inheritance (.inh} | ..—«i‘( iyY.tétsSY a 
Complexity (.cplx) ——eE 4 OK 
Size ( size) Taf cancel | (1) 


[owacr > | DV file [ENpalacios\data\bw01a. TXT jos\ 1a. TXT | 
cafePon = i C:\palacios' data\bwO a. “ 
(2) 
Figure 7. Oper Dialog forms. 


At least there must be loaded two files (they can be the same). One ( (1) in fig- 
ure 7) for independent variables and (2) for the dependent variable. 


Click Ok. 


4) To open each of the files selected (see figure 7), the following two dialogs 
has to be responded. 


LCI 2x Li——<« 24 
File name: mmm2.coh.dump 


net oe — 
@ Auto 
C' Free Cancel | 


© Fixed 


Field Separator{s) 


Cancel | 
i Tab IV Space 


T Comma TF Semicolon | Start import at row. S| 
TP Userdefined [| Text qualifier: [inone) +] 


T Get case names from first column 
1 Get variable names from first row 


Treat consecutive separators as one 
[~ Trim leading spaces 
File contents: 


Click the check box (only 


sys class 
. . ‘ 0.00000 -0.1666 0.00000 0.00000 
if the first row contains 327 a 15 


2 on 
‘14 (0.94553 
5538 


ri 


the name of variables 


Figure 8. Options. Open data file dialogs. 


The control panel (see figure 5) is displayed again on the screen. 


3.2 Selection of the variables on behalf of the hypotheses 


When investigate the characteristics of the data or when the relationships be- 
tween measures and other factors (attributes) are to be explored, the goals of 
the study should be reflected in the hypotheses and the selection of variables. 
Hypotheses are statements that explain the behaviour that is to be explored. 
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They are affirmations of the relationships among variables, which are investi- 
gated. An example of hypothesis is presented in example 1. 


Example 1: A class with low cohesion is more likely to be fault-prone than a 
class with high cohesion. Low cohesion indicates inappropriate design, which is 
likely to be more fault-prone [1]. 


Note: the hypotheses should be set during the phase Quality definition (see fig- 
ure 1). Once the set of hypotheses is stated, the set of variables of interest that 
affect that set of hypotheses must be defined. A variable is some characteristic 
or property that differs in value from one case to another. Variables can be di- 
vided in two types: independent variables, also called predictors and de- 
pendent variables that are the consequences of the dependent variable. 


Example 2: In example 1, cohesion can be taken as the independent variable 
and fault-proneness as the dependent variable. Which relationship e.g. might 
be hypothesed as that less cohesion will result into higher fault-proneness. 


In this case, the independent variables are the M-System data collected in the 
previous phase. That is, the different measures (cohesion, inheritance, etc) ex- 
tracted from the system. It will be the task of the User to decide what measures 
of interest and quality attribute as dependent variable will take part in the 
study. Usually, quality attributes as effort, cost, fault-proneness, etc, are chosen 
as dependent variables for the study. 


Examples of dependent variables [1 1]: 


— Fault proneness 

— Development effort 
— Test effort 

— Rework effort 

— Reusability 

— Maintainability 


Investigate the relationships among measures and how they influence on the 
dependent variable will be the main data analysis purpose. 

Determine statistical inference 

The logic of statistical inferences is based on two possible outcomes [5]: 


1) Null Hypothesis H,: is a statement about the state of the world to be re- 
jected 
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Figure 9. 


Example 3: From example 1, a class with low cohesion is not more likely to be 
fault-prone than a class with high cohesion. 


2) Alternative Hypothesis H;: It is a statement about the state of the world to 
be accepted 


Example 4: From example 1, a class with low cohesion is more likely to be 
fault-prone than a class with high cohesion. 


The null as well as the alternative hypotheses should be stated clearly so that It 
can be evaluated at the end of the study [10]. 


To accept the null hypothesis, the statistical significance (a-level) is evaluated. 
Alpha is a parameter to indicate the probability of rejecting the statistical hy- 
pothesis tested (H)) when in fact, that hypothesis is true. Usually, a is set at 
0.05 (5%) or 0.01(1%). That is, if the obtained statistic (p) is between assumed 
0.95 or 0.99, then the results of statistical test are significant and the samples 
are drawn from a population in which the hypothesis is true, otherwise (the ob- 
tained statistic is lower than 0.95 or 0.99), the statistical test are not significant 
(are due to chance) 


Figure 5 shows the decision table for testing. B is a parameter to indicate the 
probability of failing to reject the hypothesis tested when that hypothesis is 
false and a specific alternative hypothesis is true. For a given test, the value of 
beta is determined by the previously elected value of alpha. 


— Accepting the null hypothesis when is false: Type Il error. 
— Rejecting the null hypothesis when is True: Type | error. 


Therefore, Alpha is the probability of committing the Type | error. The probabil- 
ity of Type Il error is equal to Beta (also called the power of the statistical test). 
If test results indicates to reject H, then Hy is accepted, but if test does not re- 
ject H,, only it can be stated that there is not ground to reject Ho. 


H, True H, False 
Reject Ho Type | error: a No error:1-B 
Accept Ho No error: 1-a Type Il error: B 


Decision table for testing. 
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3.3 Select appropriate methods 


Many statistical techniques for data analysis are presented in the literature. This 
Manual focuses in those that are of interest to investigate software measure- 
ment data. 


Since the type of distribution determines the statistical technique to be applied, 
the nature and distribution of the data must be explored in advance. In this 
manual, a non-normal distribution is assumed. That is because in many cases of 
study existing in the literature ((1],[2]), it is shown that data are non-normal dis- 
tributed. 


2 groups p Student's t-test 


Normal ca 
Ze >2 groups ———_» FF statistic 
Non-Normal 


Confirming a 


theory ———SSSSSSSSsSSS Kruskal-wallis 
Box plot 
Baseline ss Scatter Diagrams 
Exploring a Normal ———p» Pearson 
relationns- 
hip Measure of associatio: 


Not tied Spearman Kendall 


‘~ Non-Normal ane 
Statistical Confirmation with Tied chi-squared 


correlation analysis 
2-variables linear regression 


Equatio Normal < 
> 2 variables multivariate regression 


Non-Normal ____, Logarithmic transformation Thiel 


Figure 10. Decision tree for analysis techniques. [5], page 20 


But it must consider that non-parametric techniques have a lower statistical power than parametric ones . 
When a sample is normally distributed the statistical power of the non-parametric test will be less than the 
corresponding parametric test and as a consequence a Type Il error (Selection 2.3) is more likely to be 
committed [10]. 


= 
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There are many books in statistics that might help to decide the technique to 
be applied [17]. Figure 3 shows the decision tree [5]. 


Words in italic & Bold reflect the statistical techniques in which the techniques 
presented in this manual are based. 


The following sections present the techniques selected to analyse measures ex- 
tracted from the M-System. Each of them contain: 


— Description and brief introduction to the statistical technique. 
— Example using the Macro. 


3.3.1 Perform Descriptive Statistics & Outlier Analysis 


During this activity descriptive statistics such as medians, means, inter quartile 
ranges and standard deviations of each measure are calculated to identify the 
measures. Descriptive statistics is also important to identify measures with low 
or zero variance. This can help explain e.g. why a measure is not useful in the 
later steps. 


Medians and inter quartile are more appropriate than mean and variance be- 
cause data set in software measurement are usually not normally distributed. 


Measures of central tendency: mean (arithmetic average) and median (mid- 
dlest value). 


Measures of distribution: standard deviation, inter quartile range (range 
containing the middle 50% of data points), minimum/maximum value 
(smallest/largest value). 


All these measures can be presented in a visual form called a box-plot [5]. Nor- 
mally, only measures with more than five non-zero data points are considered 
for all subsequent analyses [1]. 


Outliers are data points that are located outside the upper und lower value in 
box plots. It is important to consider inclusion or exclusion of outliers because 
they can have a large influence on the analysis. Two kind of outliers are consid- 
ered here [1]: 


— Univariate outliers: a class that shows an extreme value in the distribution 
of at least one of the measures used in study. An univariate outlier is in- 
fluential, if the significance of the relationship between the measure and 
the dependent variable depends on the absence or presence of the outlier. 

— Multivariate outliers: To identify multivariate outliers in the sample-space 
formed by n-measures, for each data point the Mahalanobis distance (see 
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[1.1]) is calculated (this measure provides an indication of whether or not 
an observation is an outlier with respect to the independent variable to 
each data point calculated). Multivariate outliers are data points with 
a large distance from the sample space centroid [11]. A multivariate out- 
lier may be over-influential if the significance of any of the variables in the 
model depends on the absence or presence of the outlier. 


r(x) =(x-—m,) C'(x—m,) [3.1] 
Where: 


X is the vector of raw data for the measures (independent variables) 
M is the vector of means for the independent variable 
C is the inverse of the matrix of deviations for the measures. 


Results in descriptive statistics are important because they allow comparing 
with replicated studies, so researches may determine, for example, if the data 
collected across studies stem from similar populations [1]. 


Example 


1) Click the button “Summary” to get the descriptive statistics of the data. 
This is the first step for the data analysis and must be always done. This 
option generates the spreadsheet (named “metrics selected”) used in later 


options for the analysis 


ful Statistical Analysis of M-System Data a xi 
; Descriptive Statistics & Outliers (1) _ > Principal Component Analysis (2) —4 


Summary | PCA Run | 


Dutliers | inimum eigenvalue = 1 
eas Loadfiles | 


~ Univariate Analysis (3] —————_ -- Mulltivariate Analysis (4) 


Logistic Regression | Logistic Regression 


alpha foos 


,- Cross-Validation (5) 


Run C-¥ | 


Control Panel 
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2) Select the set of independent variables and click ok. 


variables for Analysis 05) aE 
21-FCAEC 
eee 
23-0CKEC 
244FCMIC _Lercel | 


30-FMMIC 


Independent Variable/s 


Independent Variable/s 


ESM a 


Figure 12. Dialogs for selection of variables 


3) Select the dependent variable for the study and click ok. 


Devendent variable forResression— umamTE 
11-RFC_oo 21-FCAEC 
12-MPC 22-DCA4EC 
1 3-1CP 23-0CAEC 
144H-ICP 24-1FCMIC _Cancel_| 


15-NIH-ICP 25-ACMIC 


16-DAC 26-OCMIC 
17-DaCd 27-FCMEC 
1S1FCAIC 28-DCMEC 
19-ACAIC 29-OCMEC 
20-0CAIC 30-FMMIC 
Select All | 
+ 
Dependent Variable/s _Seread_| 
ps Zoom | 
Figure 13. Dialog for selection of the independent variable 


3) Select the dependent variable (counts) for the study and click ok. 


Dependent ¥ariable for Regression (Counts) 2h x) 


TTAFC 00 21FCAEC 
h2MPC 22.DCAEC 
13ICP 23-0CAEC 
TAIHACP sarcwic __caneet_| 
ISNIH-ICP = -25-ACMIC 
18DAC 26.0CMIC 
17-DACd 27-FCMEC 
1BIFCAIC ——-28-DCMEC 
1SACAIC 23-0CMEC 
200CAIC = S04FMMIC 


Figure 14. Dialog for selection of the independent variable (Counts) 


Note that all variables involved in the data analysis must be specified; Inde- 
pendent variables and dependent variables (binary variable (1,0) and counts). 
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4) The following Workbook shows the results for Descriptive Statistics. 


Name of the Analysis on 


41 


le ) 
Mean | Median | Minimum | 


Maximum | Tower 
Quartile 
233, 7,000 


76,6024) 10,0000 
2.5763 2,0000 


nnn 


7A RATA 


‘ 
| >| fF) Recuits Descriptive Statistics J 


Results 


Figure 15. Results for Descriptive Statistics 
The results (see figure 15) for each metric show the mean, median, minimum 
and maximum values, quartiles, standard deviation and Yes/No if the metric has 
more than N non-zero points. 
By default the number of minimum non-data points to be calculated is 6, this 
number can be modified in the control panel. 
5) Univariate & Multivariate Outliers are calculated clicking the button “Out- 
liers”. 
HES Statistical Analysis of M-System Data 
Descriptive Statistics & Outliers (1) Principal Component Analysis (2) — 
Summary | PC Run | 
Outliers JV Minimum eigenvalue = 1 
N> fe Load files | 
Univariate Analysis (3) > Multivariate Analysis (4) 
Logistic Regression | Logistic Regression | 
alpha foos 
- Cross-Validation (5) 
Run C-¥ | 
Figure 16. Outliers Option 
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6) Select the variables for the outliers analysis. 


Yariables for Analysis (Outliers) | i] 


Figure 17. Dialog for the selection of variables (Outliers) 


7) The results for Univariate & Multivariate Outliers are show in the following 


figures. 


21 Data Analysis Results* - Univariate & Multivariate Outliers 
(Q Data Ane 
QQ bwiot - Descriptive 5 


600 


Box Plot (Metrics selected 67v*83c) 


FE] Resuks Descripti 

Univariate & Mult 
{F] Mahalanobis Dist. 
fi] Mahalanobis Dist: 


500 $ 


400 


300 


© Median 
(1 25%-75% 
100 —L Non-Outlier Range 
numpara NIH-ICP DMMECOCMEC_L LCC NMO NMpub © Outliers 
RFC_1 ACMICRFC_oo_LLCOM3 CLD NAL 
cl a ET ae 


Figure 18. Box Plots Results 


Figure 18 displays the box plots for the variables selected. The variables for Uni- 
variate outliers can be chosen one by one to better visualisation. 


ii Data Analysis Results* - Mahalanobis Distances2 
a dats Analysis Resuts* 


Predicted & Residual Values (Metrics selected) 
Pape cecmesaceon| (penn nate Fat 
EPBy beioi Dace plive satictes 8 Ocke Observed] Predicted aed Standard | Std Er. | Mahalanobis | Deleted ) Cook's 
[El] Resuts esorptve satence [C282 No._ | Yalue | Value Pred.v_| Residual PredVal_ Distance _| Residual | Distance 
“ES ners S rukvorete occ D000 0.397603 0397603 042814 -1,10774 0081507 ——3,2a054 0.41922 0.007816 
schclenuhi Detereesd 2 0.000000 0.612087 0.612087 0.15820 -1,70530 0070289 2.15663 -0,63650 0.013399 (| 
E} 1000000 1209966 0.200086 203867 083572 0.301913 57,02908 -1,02561 0.641861 
4 0,000000 0.469031 -0,469031 -0.23287 -1,30674 0.063375 1,56847  -0,48412 0.006302 (I 
5 1000000 1522375 0522575 2.64668 -1,45536 0.243902 36,87560 -0,97051 0.375094 
6 1,000000 0,904292 0.095706 0.95701 0.26665 0,186907 21,72563 _0,13236 0.004186 | 
7 0,000000 0.205580 0.205580 0.95308 0.57275 0.055647 0,98299 -0,21064 0,000920 I 
Ey 0000000 0.274415 0.274415 076490 -0.76453 0.063120 0,80803 -0,28056 0.001407 SN 
G] 1000000 0,195058 0.804942 0.98184 2.24261 0.094863 4,73975 0.86539 0045115 | 
10 4000000 0.731550 0.268450 0.48478 0.74791 0.076508 2,73776 0.28123 0,003099 S| 
1 1.900000 0.971207 0.028793, 1,13994 0.08022 0.090080 4, 17683 
12 1,900000, 0,932751 0.067249 103481 0.18736 0.144133, 12,23462 
13 1,000000 0.328187 0671813, 061790187170, 0,058545 1,19363 
14 O.000000 0.020232) -0020232 145977, 005637 0085979371722, 
15 0.900000 0.179677 -0,179677, 1 02389, -0,50059 0059681 1,29430 
16 0.900000 -0,051740 0.051740 1.85652 0,14415 0.137992 11,13201 
17 0.900000 0.179677, -0,179677, 1 02389-05053, 0 059681 1,29430 -0,18482 0 
18 O,p00000 0.307145) -0,307145| 067543 085572 0050282 0.62125 0.31329 01 
13 O.p00000' 0.022006) -0,022006 1 45492, -0,06131 0.105607 6.11070 0.02409 0 
20 0.000000 -0,133138 0.133138, 1.87904 0.37093 0.138563, 11,3243 0.15645, 0, 
a 1,900000' 1 ,092984 0.092984 1.47287, 0.25903 0.191330, 22,31215) -0,12991 
22 7000000 0.571601 0426399004752, 1.19354 0075714 ——-2,66081) 0.44835, 
23 1000000 0,778363 0.221637 061275 061749 0.117866 7,85740 0.24844 0.005742 
24 0.000000 0.081590 0081590 -1.29203 -0.22731 0.082756 3,37107 -0,08617 0,000340 S| 
[25 0,000000 0.157601 -0,157601 -1,08424 -0.43908 0062362 1,48735 + 0.00066 SN: 
26 (0.000000 0.030686 -0.030685 1.43119 -0,08549 O0ed068 ——«3,51034 
ar O.p00000' 0,151910| -0,151910, 1.09960 -0,42323 0.054499 165989 -0,15698 01 
28 O.p00000 0525131 -0.525131, 0.07951 -1,46304 0094192, ——4,65905 0.56397 0, 
7 ‘lomennn naamng qvane a acncn AT7aEs n'ss0300 Faas 134445 
al | >| EE) Resits Descintve Statistics | EEN Univariate & Mulivaiate Dutier] =] Mahalanobis Distances] 7] Mahalanobis Distanoes2 
Figure 19. Results for Outliers (Multivariate) 
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Std Er, | Mahalanabi 
ioual_| Pred, Val 


1.113394. -0,113395 

pe See Deda 0.151714 
Se | P01 57997 -N ADDATH TE 
1 5 = 1000000 1,123694 -0,123691 1 
| 1,coon0a 1.168795 0.168795 1 
| 1,com00| 1062094 -o.ce2904 1, 
| 1.com00 osc o.ces708 09 
| 1.como0 829731 ocr 1 
| o.com00 o.4o7er3 -0.407e79 0.41 : 
| 1com00 0.922751 0.067249 1,03481 

| 1c00a00 1.120633 0.180633 1,71246 -0 S025 0.143830 
| G.qnou00 0.421631 -0431sa1 0.33639 -1,20227 0.140822 
| Q.o00000 -0,123138 0.133138 -1,87904 0.37083 0.198566 
| (conn00 88001 0.119949 09074 0.33418 0.138011 
| G.onon00 -0.061740 0061740 -1,65682 0.14418 0.137992 
| o.onon00 0.494492 0.494402 0.16382 _-1 37743 _0,133376 
| 1.coon00 O968089 0.081981 1.13136 D.O8eSB 0.124018 
{inno e124 nsa7718 0.168741 n8C2N 0.123006 
| ‘enon o9F6sn O13410 1.18199 O03738 0.122320 
D.o0000 0.194983 -0,194983 0.98205 -0,64203 0.122075 
| g.o0oa00 0.172072 0.172072 1.04488 -0.47940 0.120406 
| ‘conan 0619733 0380237 0.17919 105535 0.118586 
| ‘00000 722704 0.277296 0.46080 0.77256 0.118386 
| ‘cman o776%83 0221557 061276 061749 0.117006 
| ‘0000 o9e7en1 ocn2199 1.21284 0.00613 0.116214 
| ‘conan 0716340 0.284560 0.44047 0.79808 0.114846 


| 1,c00000 O.azaas9 0.605512 -0,16328 1 40638 0,111783 
st = : 
[Ei] Recut Descnptve Sates | i Unvait & Maltiverate Outlet [7°] Mahalanobis Distanoest [fz] Mehatnobie Ditences2] 


Results for Outliers (Multivariate) 


Figures 19 and 20 show the Mahalanobis distances of the cases, which are 
candidates (outliers) to be removed in later analyses. Figure 20 shows the re- 
sults by distances, Figure 19 by cases. The higher the Mahalanobis distance, the 
more likely (the case associated) to be outlier. 


For each execution of the macro, the results are store in different Workbooks. 
That is, only the results of same session of execution are store in the same 
Workbook. STATISTICA provides many utilities to copy, past, save, etc. spread- 
sheets. 


Click the bottom “Summary”, generates the spreadsheet named “metrics se- 
lected”. This soreadsheets contains the remaining metrics of applying the re- 
striction “> N non zero data points” 


User may save all results in the format he/she wishes (text file, Excel file, etc). 


3.3.2 Principal Component Analysis 


Introduction 


The measures in the data set (M-System Data) could be strongly correlated, in 
this case they are likely to measure the same underlying dimension (quality fac- 
tors). 


Principal component analysis (PCA) is a technique for analysis of multivariate 
data sets that provides a method for data reduction. In this case, PCA identifies 
the underlying and orthogonal dimensions that explain relationships among 
measures and it provides a technique to reduce the number of metrics with 
which it must deal. 
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PCA is applied to data set (M-System Data): 


— To capture whether measures are capturing different dimensions and 
structural aspects. if not, 

— to interpret what individual measures are really capturing and in that way 

— to identify redundancy that is present among existing measures and 

— to reduce the number of independent variables to a set that captures the 
same properties that the original ones. 


Model 


Principal Components Analysis is a transformation from the observed variables 
(measures) X,,...,x,, to new variables y,,...,y, where 


Vy, = AyX, + AyX,) +...+4,,X, 


Yq = Ay X, + Ag) X, +...+4,,X, 
[3.2] 


Vp =4yX, +4,)X) +...+a,,X, 


Principal components (PCs) are linear combinations of the standardized inde- 
pendent variables. The sum of the squares of the coefficients of the standard- 
ized variables in one linear combination is equal to one. 


Principal components (PCs, y,..,yp) are calculated as follows: 


— The first PC (y,)is the linear combination of all standardized variables which 
explain a maximum amount of variance in the data set. The second and 
subsequent PCs are linear combinations of all standard variables, where 
each new PC is orthogonal to all previously calculated PCs and captures a 
maximum variance under these conditions. Usually, only a subset of all 
variables have large coefficients- also called the loading of the variable - 
and therefore contribute significantly to the variance of each PC. The vari- 
ables with high loadings help identify the dimension the PC is capturing 
but this usually requires some degree of interpretation. 


— In order to identify these variables, and interpret the PCs, the rotated 
components are considered. This is a technique where the PCs are sub- 
jected to an orthogonal rotation. As a result, the rotated components 
show a clearer pattern of loadings, where the variables either have a very 
low or high loading, thus showing either a negligible or a significant im- 
pact on the PC. There exist several strategies to perform such a rotation. 
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Figure 21: 


Figure 22: 


Here the varimax rotation is used, which is the most frequently used 
strategy in the literature. For interpretation of the rotated components 


Data analysis process 


(RC), variables with high loadings (>0.7 or <-0.7 is a useful threshold) in 


each RC and identify what these measures have in common, are consid- 


ered. 


— Fora set of n measures there are, at most, n orthogonal PCs, which are 
calculated is decreasing order of variance they explain in the data set. As- 


sociated with each PC is its eigenvalue, which is a measure of the variance 
of the PC. Usually, Only a subset of the PCs is selected for further analysis 
(interpretation, rotated components, etc). A typical stopping rule that can 
select in the macro is that only PCs whose eigenvalue is larger than 1.0 
are selected [1] [2]. 


In the next section it shows how to apply PCA on the data set with the macro. 


How apply PCA 


1) Click the button “PCA Run”. 


Statistical Analysis of M-System Data x 


(Descriptive Statistics & Outiers (11 —) Principal Component Analysis (2) — 
Sei PCA Run 
Outliers 9 Minimum eigenvalue = 1 

NTs Load fles 

Univariate Analysis (3) Multivariate Analysis (4) 
Logistic Regression Logistic Regression 

alpha [005 

-CrossValdation (5) 
Run CV 


Control Panel. PCA Option 


2) Select the variables for the analysis. 


variables for Analysis (PCA) tid 


T1-NIH-ICP. 


20-0MMIC 


Selected Variable/s 


27-DMMEC OK 
22-0MMEC Lx] 


281CP_L 
29-NIH-ICP_ 
30-0CMIC_I 


| Select All 
Spread 


Zoom 


Dialog for the selection of variables 


Results are present in the following figure. 
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Figure 23. 


xox 
‘<Q Data Analysis Results* 

&)-y bwilot - Descriptive Statistics + 
[|] Results Descriptive Statis! 
=]-¥ Principal Component Analysis 
[FE] PCA - FactorLoadings 


Factor Loadings (Varimax normalized) (Metrics selected) 
Extraction: Principal components 
(Marked loadings are > 700000) 
Factor Factor | Factor | Factor Factor | Factor 
f 2 3 | 4 5 i 6 


NMO 0.07762, 0608612 0668015 0.161416 -0013411 -0,151550 
0.02164 0.139946 -0,049292 -0,063530 0.973699 -0,036057. - 
0,92269 -0,106385 0.024576 0.226703 0,127763 0.085848 
0.08183, 0.749900 0.292843 0.030372 0.016645 -0,013103 
0,40168 0,108216 -0,050707 0.366513 0.781218 -0,013268 
NAL 0,42820 -0,103524 0.075252, 0.694510 0.101124 0.101666 
0.18292, 0.212434 -0,119471 0071956 0.901269 -0,092352) - 
_| 0,55027| 0,145193 0.088568 0,108986 0.792776 -0,003956 
0.89495 0.069196 0.209678 0.261508 0.117183 0.038962 
0.02164 0.139946 -0,049292 -0,063530 0.973699 -0,036057  - 
0,37100 0.376175 0.215724) 0.134943 0.628871 -0,063804 
0.57435 -0,024961 -0,006509 0074698 0767614 0.034649. - 
0,93205 0.005891 0076815 0.201452 0.127430 0.086993 
ICMIID | 0.94122 -0,004378 0.023367 0.208399 0,157859 0.08798 - 
ICAR 0,89361-0,021164 0.082955 _0,395372_0,104093_-0,007260 
Eigenvalue [2191070 7.715724 7.250841 4.711363 4.171240 3.478174 
% Total | 33,70878 11,870345 11,155140 7.248281 6.417292 5351036 
Cumulative | 33,70878 45,579120 56,734260 63982541 70 399833 75750870 7> 
eect RA ERA C aca A 


4 >| [E] PCA - FactorLoadings 
Results for PCA 


Results are presented in figure 23. For each measure it shows its corresponding 
loading (see section 3.3.2 - Model). In the later rows is showed the eigenvalue 
of each factor, the variance of the data set explained by PCA (in percent) and 
the cumulative variance in the table. Loadings values higher that 0.7 are in 
highlight. 


Restrictions: If the number of metrics to be selected if figure 22 are small, the 
option “minimum eigenvalue = 1” must be deselected. 


3.3.3. Correlation of measures to size - Optional 


The main reason why correlation of measures (coupling/cohesion etc) to size 
were analysed in the past, was to investigate if those measures actually meas- 
ure something different than size. Size might have a great influence on the de- 
pendent variable. If size measures (that are much simpler to collect) were good 
predictors, than it would be preferable to use those, instead of the sophisti- 
cated coupling/cohesion measures. Nowadays, this analysis is concerned as re- 
dundant with PCA. If the size measures are included in the PCA then it can be 
detected if one or more dimensions are characterized by size and which meas- 
ures contribute to the size dimensions. 


3.3.4 Univariate Analysis 
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Introduction 
Univariate Analysis is used to construct predictions models for each individual 


measure against the dependent variable to determine if the measure is statisti- 
cally related. The objective is to investigate which measures have an impact on 
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the dependent variable. Dependent variables are the result of goal- and or hy- 
pothesis definition and should therefore be provided during the phase Quality 
definition; see figure 1. 


The choice of a regression analysis method depends on the nature of the de- 
pendent variable at hand. Logistic and Poisson regression are useful for effort 
data. Both regression methods have applied by the M-System for univariate 
data analysis so-far [1][3]. This section focuses on Logistic regression mainly, 
because Poisson has the advantage of an unbiased effort prediction, but is also 
a very specific approach. 


Logistic Regression 


Univariate logistic regression is a statistical standard technique that applies the 
following equation: 


e (Bot Aix) 


a(x )= [3.3] 


1+ oe Por Bix) 


— (conditional probability) is the probability that an event occurs, taking 
into account that the objective is to assess the impact of each measure on 
the dependent variable. 

— x indicates the design measure used as independent variable in the model 
(also called covariate of the logistic regression equation).. 

— The coefficients By and B, are estimated through the maximization of likeli- 
hood functions (maximum likelihood estimation) 

— The curve between z and x; (in the multivariate case)— assuming that all 
other x; — takes a flexible S shape which ranges between two extreme 
cases: 

1) When a variable is not significant, then the curve approximates a hori- 
zontal line, i.e., a does not depend on x; . 


2) When a variable entirely differentiates the dependent variable software 
parts, then the curve approximates a step function. 


— The coefficients B; are estimated through the maximization of a likelihood 
function, built in the usual fashion, i.e., as the product of the probabilities 
of the single observations, which are functions of the covariates (whose 
values are know in the observations) and the coefficients (which are un- 
known). For mathematical conveniences, | = In{L], the loglikelihood, is 
usually the function to be maximized. This procedure assumes that all ob- 
servations are statistically independent. 

— Toasses the impact of each measure on the dependent variable, the odds 
ratio is used, because the regression coefficients B, cannot be easy inter- 
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preted for this purpose. [9]. The Odds ratio y(X ) represents the ratio be- 


tween the probability of the event in the dependent variables occurs and 
the probability of not, when the value of the measure is X. 


[3.4] 


where o is the standard deviation of the measure. Therefore, Aw represents 


the reduction/increase in the odds ration when the value X increases by one 
standard deviation. 


— Toasses the statistical significance of each independent variable in the 
model, a likelihood ratio chi-square test is used. 


— Let! = In [L] be the loglikelihood of the model given in equation 3.3, and li 
be the loglikelihood of the models without variable xi. Assuming the null 
hypothesis that the true coefficient of Xi is zero, the statistic g = -2(I-li) fol- 
lows a chi-square distribution with one degree of freedom (y7(1). There- 
fore, p-value = P(x*(p)> G) it will be tested. If p is larger than some level of 
significance alpha, the observed change in the loglikelinood may be due to 
chance, and Xi is not considered significance. If p<=alpha, Xi, the observed 
change in the loglikelihood in unlikely to be due to chance, and xi Is con- 
sidered significant. 


Assessing Fit 


To fit the model the following formula Llogistic regression R’ is used (not be 
consused with the least-square regression R’). The higher R?, the higher the ef- 
fect of the modes’s explanatory variables, the more accurate the model. How- 
ever, as opposed to the R? of least square regression, high R’s are rare in logis- 
tic regression: 


R’ is defined by the following ration. 


LL, — LL 
R? =—*_—, [3.5] 
LL 


AY 


where: 


LL is the loglikelinood obtained by Maximum Likkelihood Estimation 
LL, is the loglikelinood obtained by Maximum Lilelihood Estimation of a model 
without any variables, i.e., with only Bo 
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Details about Logistic Regression can be found in [9]. 
Poisson Regression — Optional step 


Poisson Regression is applied to the analysis of counts of events. In this case (M- 
System data), the dependent variable y is assumed to have Poisson distribution 
with parameter wu. 


nO), +e 


where the parameter pu is the expected value and variance of y: w=E(y)=Var(y) 
(equidispersion). 


For a given data set, the regression coefficients Bo,..., B, can be estimated 
through maximization of the likelihood function. 


But for the data analysis, the negative binomial model is used instead. The 

negative binomial model is an extension of the Poisson distribution, which al- 
lows the variance of the process to differ from the mean. That is, Var (y | X) > 
E(y | X), known as overdispersion, what is more commonly find in practice [3]. 


Al & v 
ee T(y+yv){ v u fi = cP orbit Bt 
cai yTQ) v4 a) vt a) | 


with v>0 that is estimated along with the regression coefficients Bo, B,, in a 
maximum likelihood estimation based on the above probability distribution. 


HM =E(y|X) but Varly | X) = ~@+af, where a=v_| 


a is called the dispersion parameter. For alpha ->0, the negative binomial 
model converges towards the poisson model. 


According to all these assumptions, the next two sections show two examples 
with both techniques. 


Details about Poisson Regression can be found in [17] 
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Figure 24. 


Figure 25. 
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Example: Logistic Regression 


1) Click the button “Logistic Regression” from Univariate Analysis. 


tatistical Analysis of M-System Data 


Control Panel. Logistic Regression. 


2) Select the variables for the analysis. Dependent variable (left), independent 
variables (right). 


Dialogs for the selection of variables 
Click Ok. 
Note that the selection of the variables cannot be overlap. 


3) The results for Logistic Regression are presented in the following figure. 


Copyright © Fraunhofer IESE 2003 


Data analysis process 


a 
cool owen 45 a2 dds Rati 


T}o.o73916 0.030336 oe) o,127211, 100224493, 
0,043081 0.013962 1,080E-6 0.208434 100060167 NNN 
0.040749 0,030329 0.144372 0.018677 100123663 Ii 
0.05953 0,019058 3,1496-7 0229312 1,00113170 | 
0,003 0,150491 0,994093 3,484E-6 1,00045155 i 
-0,03305 0,181596 0.655667 0,00029 0,99401509 NN Ann 
-0,01706 _0,14976 0.909369 0.000114 0,99744051 I 
0.713651 0,302472 0,000733 0.099958 124092743 
0.323196 _0,19068 0.060434 0,030902, 1 06350553 
0.010772 0,014418 0.363727 0.007231 1,00015532 Ii 
0.089261 0.044008 0,000457 0,107675 1,00393595 (INN 


1 734563 0.513216 0,000191 0,130895 2,43562947 

1,734563 0513216 0,000177 0130895  2,43562947' 

0.49295 _0,19409 0,000029 0,153. 1,10040325 NN 

3 09_L | 0.010259 0.005755 0.066674 0.02479 1,00005004 (iI 
L 0.655478 0.250152 2,928E-6 0,191639 1,17817769 In| 

0.244798 0,092696 2,106E-6 0,197186 102295116 in 

0.244798 0.092696 2,706E-6 0.197186 102295116 Ei 


Figure 26. Results for Univariate Logistic Regression. 


For each metric is presented the results of applying the equation [3.3] to each 
metric: its coefficient in the equation (3.3), Standard deviation, p-value, R* and 
Odds Ration like is defined in this Section (3.3.4). Significant p-values are high- 
light. 


3.3.5 Multivariate Analysis. 


Introduction 


Multivariate Analysis is used to investigate the relationship between one de- 
pendent variable and two or more independent variables (measures). This 
analysis is conducted to determine how well multiple measures predict the de- 
pendent variable, when the measures are used in combination. 


Logistic Regression 


In Multivariate Logistic Regression, the same assumptions as in Univariate Logis- 
tic Regression (see section 3.4.4) are taking into account. In this case (two or 
more variables) the equation remains: 


el Fot Aix +...48,%,) 


T(Kysvigh,) = [3.6] 


1+ elo + Bix, +..-+BnXy) 


Where coefficients, statistical significance and the assessing fit of model are 
calculated in the same way like in 3.4.4. 

Sometimes the number of measures (number of independent variables) are too 
large to be treated. To reduce them and select the measures to be used in the 
model, a strategy must be employed that 
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— Minimizes the number of independent variables in the model. Using too 
many a independent variables can have the effect of increasing the esti- 
mated standard error of the model's prediction, making the models more 
dependent on the data set, i.e., less generalizable. A rule of thumb is to 
have at least ten data points per independent variable in the model. 


— Reduces multicollinearity, i.e, independent variables which are highly corre- 
lated. This makes the model more interpretableMulticollinearity reduction 
is performed calculating the conditional number [1], a value which is based 
in the correlation matrix of the independent variables (measures) in the 
model. According to [9], tests for multicollinearity used in least-squares re- 
gression are also applicable in the contest of logistic regreesion. Here is 
applied the conditional number [19 —belsley)], which is based on the con- 
ditional number of the correlation matrix of the covariates in the models. 
The conditional number is defined as: 


where 

Imax iS the largest eigenvalue of the principal components. 

min IS the smallest eigenvalue. 

A large conditional number indicates the presence of multicollinearity. The 
degree of multicollinearity is harmful, and corrective actions should be 
taken, when the conditional number exceeds 30 [1]. 


To build multivariate models (regardless of logistic/poisson/OLS regression), a 
stepwise selection process can be used, which subsequently adds/removes 
variables to a model untils its goodness of fit can not be improved anymore. 
The two major stepwise selection processes used in logistic regression are for- 
ward selection and backward elimination. The general forward selection proce- 
dure starts with a model that includes the intercept only. Based on certain sta- 
tistical criteria, variables are selected one at a time for inclusion in then model, 
until a stopping criteria is fulfilled. Because of the large number of independent 
variables (M-System Data), the forward selection procedure to build the pre- 
diction models is selected. In each step, all variables not already in the model 
are tested: the most significant variable is selected for inclusion in the model. If 
this causes a variable already in the model to become not significant (at al- 
pha.,i=0.10), it is deleted from the model. The process stops when adding the 
best variable no longer improves the model significantly (at alphaente, = 0.05). 
The signficance of a variable is tested by a loglikelihood ratio test (see section 
3.4.4) 
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Example: Logistic Regression 


1) Click the button “Logistic Regression” from Multivariate Analysis. 


Ei Statistical Analysis of M-System Data x) 
Descriptive Statistics & Outliers (1) _ + Principal Component Analysis (2) — 
apsy | PCA Run | 
Outliers Vv Minimum eigenvalue = 1 
N> fe 
; Univariate Analysis [3] > _ - Multivariate Analysis (4) Cancel 
Logistic Regression | Logistic Regression | 
alpha [0,05 TS 
i) 
;- Crass-Validation (5) 
Run C-¥ | 


Figure 27. Control Panel. 


2) Select the variables involved in the analysis. Note that it must perform some 
kind of interpretation on the previous results to carry out this one 


SelectAl| Spread | Zoom | Selecta Spread | Zoom | 


Dependent Variable Independent Variable/s 
i free 

Figure 28. Dialogs for the selection of variables 
Click Ok. 


The results are presented in the following figures. 


cll | >| FF] Losisic Regression Resuts [E] goodness oft] 


Figure 29. Results for the model 
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Figure 30. 


Goodness of fit of the model 


Two spreadsheets are generated for Multivariate analysis. One (figure 29 ) 
shows the results for the model (equation 3.6): Coefficient, Std. Deviation a p- 
value. The heading in the spreadsheet shows the values for R square, Condi- 
tional Number and Loglikehood for the model generated. The second Spread- 
sheet (figure 30) shows the results of goodness of fit for the model. Complete- 
ness, Correctness and Kappa- value like is defined in the next section (3.3.6) 


3.3.6 Evaluating estimation accuracy 


28 


The accuracy of the prediction model (that describes the relationships between 
independent and dependent variables) should be determined to provide infor- 
mation about the probability that the data analysis outcomes are right. 


Several techniques to the determine the accuracy of the prediction models ex- 
ist: Magnitude of Relative Error (MRE), Completeness, Correctness, Kappa. The 
type of regression techniques and the dependent variables will determine the 
suitability of the accuracy technique. 


The following measures might be used to evaluate the goodness of fit of the 
prediction models: 


— Completeness: the degree to which all the parts of a software system or 
component are present and each of its parts is fully specified and devel- 
oped [12]. Here, the completeness is concerned with which % of the 
classes fulfils a specific criteria. 


— Correctness: the degree to which a system or component is free from 
faults in its specification, design and implementation. [13]. In this case the 
completeness in concerned with % of the classes classified correctly (ac- 
cording to a criteria). 


— MRE (Magnitude of Relative Error) or ARE (Absolute Relative Error). 
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ARE 


@ 


MRE = where ARE =| eff, — eff, | 


i 


— R?* (see section 3.4.4) is also used as a measure of the goodness of fit. But 
this measure is specific for Logistic Regression techniques based on maxi- 
mum likelihood estimation. 


— Kappa [14] is a measure of the degree of agreement of two variables. Ap- 
pendix B elaborates on Kappa. 


3.3.7. Cross-Validation 


Introduction 


The results of accuracy of the prediction models might be optimistic because of 
they are applied to the same data set which is derived from. These prediction 
models should be applied to a different data set. 


Cross-validation is a common method used for model checking in regression 
problems that will help in this task. 


The steps for making a Cross-validation are described as follows: 


— Divide the data set into k subsets of similar size (normally k = 10). 

— Each time, one of the k subsets is used as the test set, and the other k-1 
subsets are put together to form a training set. It re-fits the model using 
the k-1 subsets and then applies the model to the currently partition (k 
subset). The point is to re-apply the techniques presented in Section 3.3.6 
to newly predicted values. 
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Figure 31. 


Figure 32. 


Figure 33. 
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Examples: Cross-Validation 


1) Click the button “Cross-Validation” 


Descriptive Statistics & Outliers 1] _ ;- Principal Component Analysis (2) — 
Summary | PCA Run | 
Outliers JV Minimum eigenvalue = 1 
N>[e Loadfiles _| 


j Univariate Analysis (3] ——_ > Multivariate Analysis (4) 


Logistic Regression | Logistic Regression | 
alpha fo. 05 
- Crass-Validation (5) 


Run CV | 


Control Panel. 


2) Select the variables involved in the analysis. Note that It must perform some 
kind of interpretation on the previous results to carry out this one 


‘ 4 Le » 
Select All| Spread Zoom Select AN] Spread Zoom 
Dependent Variable Independent Variable/s 

fe [rr 


Dialog for selection of variables 


The results are provided in the following figure. 


4 | DY E)Gioss Validation /ooodness off 


Results for Cross-Validation. 


To select the threshold (see figure 33) for the cross-validation, the percentage 
of classes being classified fault-prone is roughly the same as the percentage of 
classes the actually are fault-prone, so that the threshold is selected to roughly 
balance the number of actual and predicted fault-prone classes. 
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3.3.8 Model Application 


The prediction models built in the last section (3.3.7) may be applied in prac- 
tice, the results obtained from the predictions models can be presented in plots, 
(for example the classes can be sorted in decreasing order of their predicted 
fault-proneness), so that when planning inspections during the software devel- 
opment process, it can be able to make a trade-off between the resources 


spent on inspections and effectiveness as well as the evaluation of tits perform- 
ance. 
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Figure 34. 
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Summary 


The main goal of this manual was to describe the process of analysing M- 
System Data (Software Measurement data) using statistical techniques to inves- 
tigate the relationships between measures and quality attributes (fault- 
proneness, effort, etc). 


This process implies to state the definition of the goals, variables of interests 
and hypotheses to be tested. 


To analyse the data, a first study about the nature and distribution of the data 
can draw conclusions that could to address the analysis to different techniques. 


Descriptive Analysis and Outliers is a first step for this analysis. The great 
amount of metrics measured with M-System could be too much information to 
be analysed with statistical techniques. PCA (Principal Component Analysis) 
provide a method to reduce the number of metrics to smaller amount of them, 
in which it is represented the same information. 


Univariate and Multivariate Logistic regression are applied to analyse the rela- 
tionships between the measures extracted and the dependent variable. Poisson 
or Logistic Regression are applied according to criteria. 


Finally all models developed with the statistical techniques are validated. 


A Tool that supports the process presented has been developed. The following 
table summaries the process. 


What do you want | Number of vari- Statistical technique 
to do? ables 
Distribution of data Boxplots 
Descriptive 
Statistics 
Investigate the correla- Optional 
tion with Size ween nnn enna n= 
PCA 
Explore the relation- One independent | Univariate Logistic Regression 
ships among variables | variable logistic re- 
gression Poisson 


> two independ- | Multivariate | Logistic Regression 
ent variables logistic re- 
gression 


Poisson 


Summary 
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6 Appendix A.1: Tool Usage 


Hardware & Software Requirement 
The Macro used for the statistical analysis has been implemented under 


STATISTICA. STATISTICA 6.1 of StatSoft ® [18] should be therefore installed in 
your computer before starting with the analysis. 
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7 Appendix A.2: Statistica Macro 


The following describes the installation of the Statistica macro. 

1) Open STATISTICA Application 

2) Goto Open in menu File or click the open button in the navigation menu. 
3) Load the file “Macro for statistical Analysis of M-System data.svb” 


4) To execute the macro, click in the 


How to Start 


1) Initialise the STATISTICA environment. 


ERE Ras 


ial 


easy [envi (Sr [wagnearr [EAE INU [REC 


Figure 35. STATISTICA environment 


2) Go to Tools-Macro-Macros. 
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Figure 36. Options for load a macro. 
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3) Open or Run the macro called “Macro for Statistical Data Analysis of 
Code Measurement Data.svb” 


Description: 


Sctipting: STATISTICA Visual Basic 


Figure 37. Macro form 


The source code of the macro is visualised (if open). 


Tear FAUM TREC Al 


Figure 38. Source Code 


3) To execute the application, please click F5, the icon», or go to Tools- 
Macro-Macros and click Run. 


ee Ez 
|T[Se Et von Dehn Bin Soises Ginks Bos Widow He 
[Deu 6h sees|5 E e “en. 
aS i | wm UE 


incl Comoanent Ana) 

uy ean 

ties Miu sieatie=1 

[ca 

vat Arai) Nv Ane) 
Logie Regesion LogiioRegeson 

abhe fos 
‘Ganifdatcn 6) 


Func 


[eae [ea [RC 


[Bhircbrt- Moositved) Blcorestodicna.| RECGISS o3 


qa 
7 2 
Pome) | Bieta Ninos cuts |ISATISTICA-Macro. Glpsoon 


Figure 39. Applicaton form 


Section 3.1 explains how to read the files. 
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Kappa [14] is a measure of the degree of agreement of two variables. Kappa 
are based in the following formula. 


Figure 39. Kappa coefficient. 


Where: 


1X ae 
- R= — >in, : Level of agreement before prediction 
N j= 


1 att 
- P= oe all : Level of agreement after prediction 
i=l 


Variable A 
2s Be aa r Total 
1 Nyy M2 M43. My ny 
Variable B 2 Nar M22 23... Aare Np 
r aa Ni Ar wf Dir nr 
Total n, nM ns Nn, N 
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Interpretation of k-values [15]: 


K-values Interpretation 
Below 0.00 Poor 
0.00-0.20 Slight 
0.21-0.40 Fair 
0.41-0.60 Moderate 
0.61-0.80 Substantial 
0.81-1.00 Almost perfect 


Kappa values (k) range between -1 and 1, the higher the value, the bet- 
ter the agreement. A Kappa zero indicates that the agreement is no bet- 
ter than what can be expected from chance. 


Example: 
predicted x 
nm<=0.75 t>0.75 
Actual No fault 49 classes 31 classes 80 classes 
Fault 31 classes 15 classes 46 classes 
y 80 classes 46 classes 126 classes 


Taking into account the table above, Kappa coefficient remains: 


I 

P, = —(49 +15) = 0,507 

0 = Tog! ) 

= 1 (80* 80+ 46 * 46) = 0,536 
126 

j, = 0507-05536 __9 655 
1-0,536 


The Kappa-value results is -0,625. According to the interpretation of k-values, 
the results show a substantial agreement of the variables. 
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Appendix C: Validity threats 


It is important to consider the validity of the results. According to [8], results 
can be validated from four points of view: 


Conclusion validity (fig.8, 1): is concerned with the way to draw the correct 
conclusion. In this case, the selection of a particular tool to extract the data 
set (M-System) and process of analysis selected (see fig. 2). 


The fact of select cases of study with small sample size or make some as- 
sumptions about the distribution of data to choice the “correct” statistical 
test, can entails results with a high margin of error or conclusions which 
are difficult to be generalized. 


Internal validity (fig.8, 2): The dependent variables may not only be influ- 
enced by the independent variables. Internal validity in concerned with the 
degree to which conclusions can be drawn about the causal effect of the 
independent variables on the dependent variables [1]. 


Investigate whether the relationship between the measures and the de- 
pendent variable is due to causal effects or not is a risk to be studied. 


Construct validity (fig.8, 3): The selection of the correct independent and 
dependent variables (see Section 2.3) can also influence the conclusions. 
The selected variables could not measure accurately the factors that they 
try to measure. 


External validity (fig.8, 4): The degree to which the results of the research 
can be generalized to the population under study and other research set- 
ting [1]. 


Experiment objective 
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Theory Cause Cause-effect 
Construct construct 


Treatment 
construct 


a 


Observation Independent variable Dependent Variable 
(measures) 
Figure 40. types of validity [8] . 
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